library(tidyverse)
library(lubridate)
library(plotly)
library(gridExtra)
library(kableExtra)
library(scales)
options(scipen = 999) #0 to reset to default
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE, fig.width = 10, fig.height = 3)

What this project entails and what it hopes to achieve

The Global Games Market reached $175 billion USD in revenue in 2020 and an additional 9.4% growth is projected by Newzoo, a leading market research company. Although PC gaming occupy the lowest share (22%) in 2020, the pandemic has accelerated the PC market share due to the new work from home regime. Coupled with the rising popularity of E-Sports, PC gaming is set to grow as high as 25% in the next 3 years.


Image 1: Global Games Market 2020


Furthermore, the annual report published by Steam for 2020 reported not only increased number of users, but up to 50% more playtime.

Image 2: Steam Year in Review 2020


As I am from the video games industry, and we are looking to grow our PC market share, I hope to uncover some key insights to help plan our strategy such as:



Selecting the dataset and defining the variables

The main three datasets I have selected for this study is a webscrapped version of the steamcharts/steamsdb website. These data from these sites are pulled via public Steam APIs, so there’s trustworthiness in the accuracy there.


Dataset 1: Monthly users for top 1,200 games from 2012 to 2020
Source - Kaggle

steam_monthly <- read_csv("datafiles/SteamCharts.csv", col_types = 'cccdddd')
head(steam_monthly)

gamename - name of game title in Steam store
avg - average number of daily concurrent users for a specific game in that month/year
gain - delta of avg from previous month
peak - highest number of daily concurrent users for a specific game in the month/year
avg_peak_perc - % of avg over peak


Dataset 2: Total Steam Peak CCU and Total in-game Peak CCU from 2012 to 2020
Source - SteamDB

steam_total <- read_csv("datafiles/SteamCharts_Total.csv", col_types = 'ccdddddd')
head(steam_total)

peak - highest number of daily concurrent users for Steam (gaming + idle) in the month/year
gain_perc - delta of peak from previous month
peak_daily_min - lowest number of daily concurrent users for Steam (gaming + idle) in the month/year
peak_daily_avg - average number of daily concurrent users for Steam (gaming + idle) in that month/year
peak_ingame - highest number of daily concurrent users who are playing a game in that month/year


Dataset 3: Positive and Negative ratings for each game in 2020
Source - SteamDB

steam_reviews <- read_csv("datafiles/SteamCharts_Review.csv", col_types = 'cddd')
head(steam_reviews)

gamename - name of game title in Steam store
postive - total number of positive reviews to date
negative -total number of positive reviews to date
rating - % of positive reviews over total reviews, adjusted for games with less than 100 reviews



Exploratory data analysis

The initial loading of the data show separate columns for year and month, and this would prevent any form of time-series visualization. Thus, the first step of data cleaning will be converting to a proper date and dropping the two columns (year, month) for both steam_monthly and steam_total data frames.

steam_monthly <- steam_monthly %>%
                    mutate(datemonth = dmy(paste(1, month, year)), .before = gamename) %>%
                    select(-c(year, month))

steam_total <- steam_total %>%
                    mutate(datemonth = dmy(paste(1, month, year)), .before = month) %>%
                    select(-c(year, month))


1. Summary statistics for steam_monthly

Based on the summary statistics, the data covers from Jul 2012 to Feb 2021 but not all games will have the full range. There’s some problems with the data, particular the zero values in avg and peak variables. Furthermore, it seems the data is heavily skewed, with the maximum value being a lot larger than the median. There is also 1,252 NA values in gain, which could indicate games with missing month data. NAs in avg_peak_perc are caused by zero values in peak, as division by zero will give you an error. In the context of tracking concurrent users (CCU), having zero values means either your game is being played without an internet connection, or Steam has blocked traffic from client side due to cheating/hacking/game files modification.

summary(steam_monthly)
##    datemonth            gamename              avg           
##  Min.   :2012-07-01   Length:83620       Min.   :      0.0  
##  1st Qu.:2016-03-01   Class :character   1st Qu.:     53.3  
##  Median :2018-02-01   Mode  :character   Median :    204.0  
##  Mean   :2017-10-17                      Mean   :   2795.4  
##  3rd Qu.:2019-09-01                      3rd Qu.:    770.1  
##  Max.   :2021-02-01                      Max.   :1584886.8  
##                                                             
##       gain                peak         avg_peak_perc   
##  Min.   :-250249.0   Min.   :      0   Min.   :0.0000  
##  1st Qu.:    -38.3   1st Qu.:    138   1st Qu.:0.3450  
##  Median :     -1.6   Median :    501   Median :0.4386  
##  Mean   :     -9.9   Mean   :   5524   Mean   :0.4248  
##  3rd Qu.:     22.3   3rd Qu.:   1736   3rd Qu.:0.5140  
##  Max.   : 426446.1   Max.   :3236027   Max.   :0.8884  
##  NA's   :1252                          NA's   :156
paste("There are a total of", n_distinct(steam_monthly$gamename), "distinct game titles in the data frame.")
## [1] "There are a total of 1251 distinct game titles in the data frame."


If we look at the top 20, it is very evident why the distribution is skewed so. The top 3 games - PUBG, Dota 2 and Counter-Strike, have trends that are very far from the rest of the top 20. Even within the top 3, PUBG had such a huge spike in the initial few months before declining.

steam_games_summary <- filter(steam_monthly, avg > 0) %>%
                            group_by(gamename) %>%
                            summarize(avg_avg = mean(avg),
                                      avg_med = median(avg),
                                      avg_min = min(avg),
                                      avg_max = max(avg),
                                      avg_sum = sum(avg),
                                      min_month = min(datemonth),
                                      max_month = max(datemonth),
                                      n = n()) %>%
    mutate(no_of_months = interval(min_month, max_month) %/% months(1) + 1) %>%
    relocate(min_month:no_of_months, .before = avg_avg) %>%
    arrange(desc(avg_avg))

steam_monthly$gamename <- factor(as.factor(steam_monthly$gamename), levels = steam_games_summary$gamename, ordered = TRUE)

filter(steam_monthly, gamename %in% pull(steam_games_summary %>% head(n = 20), 1)) %>%
ggplot() +
  geom_line(aes(x = datemonth, y = avg, color = gamename), size = 0.2) +
  scale_linetype_manual(values = c("TRUE" = "solid", "FALSE" = "dotted")) +
  labs(title = "Monthly CCU for top 20 games", color = "Descending order", y = "average CCU") +
  guides(col = guide_legend(nrow = 10)) +
  scale_color_manual(values = rainbow(20))


The confidence interval for average CCU hovers near the median of 2795, and if we look at the distribution spread (by bins of 2,000), you can see a huge spike in records that fall below 2000. Upon adjusting the bins and having a zoomed in look on the data, there seems to be an unusually large amount of records (41k, which is almost half of the dataset) that fall below 200 average CCU. Putting into context that this dataset represents the top 1,200 games, these records could be data errors or early access to the game before it was launched. Plus, if you look at the actual values, it’s not a whole number either so it might very well be an estimation from the server logs after all.

cat("The 95% CI for [avg] is", paste(round(confint(lm(avg ~ 1, steam_monthly, level = 0.95))), collapse=" - "))
## The 95% CI for [avg] is 2616 - 2975
ggplot(steam_monthly, aes(x = avg)) +
  geom_histogram(binwidth = 2000) +
  labs(title = "Distribution of average CCU by bins of 2,000", subtitle = "Full data view")

ggplotly(
ggplot(steam_monthly, aes(x = avg)) +
  geom_histogram(binwidth = 100) +
  coord_cartesian(xlim = c(0, 5000)) +
  labs(title = "Changed bins to 100, zoomed into more granularity")
)
filter(steam_monthly, avg > 1 & avg < 200) %>%
  arrange(avg) %>%
  head()


On the other hand, there also doesn’t seem to be any impact to CCU with more months after launch as seen from the correlation number.

ggplot(steam_games_summary) +
  geom_point(aes(x = n, y = avg_avg), alpha = 0.5) +
  geom_text(aes(x = n, y = avg_avg, label = ifelse(gamename %in% pull(steam_games_summary %>% head(n = 3), 1), as.character(gamename), "")), hjust = 1, vjust = 1) +
  labs(title = "Avg CCU vs Number of Months for all games", x = "number of months", y = "average CCU")

paste("The correlation between avg and number_of_months is", cor(steam_games_summary$avg_avg, steam_games_summary$n))
## [1] "The correlation between avg and number_of_months is -0.0351747370434565"


Games with missing months

As seen below, there are 75 games (about 6%) with irregular month data. The top game (Mount & Blade II: Bannerlord) in this list has 39 missing months from the data, which basically renders it sub-par when it comes to time-series plotting.

Looking at the line charts, it becomes very evident that these games have an initial spike in numbers and regulated in the following months. This is not unusual in this industry as the first month is where the user tries out the game and decide if they are going to continue to play. However, the data is a little odd for some games here (in particular Killing Floor 2, Stellaris and SCUM). There’s a trailing almost-zero data point prior to the first spike, and this could be due to data errors or early access to the game (like in the previous histogram EDA). We would definitely need to remove these values before realigning each starting month so we can compare games (i.e. launch month success, first 6 months, etc).

steam_games_summary %>%
  filter(n != no_of_months)
filter(steam_monthly, gamename %in% pull(steam_games_summary %>% filter(n != no_of_months) %>% head(n=2), 1)) %>%
ggplot(aes(x = datemonth, y = avg, color = gamename)) +
  geom_line() + geom_point() + labs(title = "Top 2 games with missing data", y = "average CCU") + 
  scale_color_manual(values = rainbow(2))

 filter(steam_monthly, gamename %in% pull(steam_games_summary %>% filter(n != no_of_months) %>% head(n=10) %>% tail(n=8), 1)) %>%
ggplot(aes(x = datemonth, y = avg, color = gamename)) +
  geom_line() + geom_point() + labs(title = "Top #3 - #10 games with missing data", y = "average CCU") + 
  scale_color_manual(values = rainbow(8))


Final cleanup for [steam_monthly]

Based on the EDA, we can modify the steam_monthly dataframe to exclude average CCU less than 200, and add a new column that calculates the number of months from launch so that we can do a comparison of games.

steam_monthly <- filter(steam_monthly, avg >= 200)

steam_games_summary <- group_by(steam_monthly, gamename) %>%
                            summarize(avg_avg = mean(avg),
                                      avg_med = median(avg),
                                      avg_min = min(avg),
                                      avg_max = max(avg),
                                      avg_sum = sum(avg),
                                      min_month = min(datemonth),
                                      max_month = max(datemonth),
                                      n = n()) %>%
                            mutate(no_of_months = interval(min_month, max_month) %/% months(1) + 1) %>%
                            relocate(min_month:no_of_months, .before = avg_avg) %>%
                            arrange(desc(avg_avg))

steam_monthly <- steam_monthly %>% 
                    left_join(select(steam_games_summary, c(gamename, min_month))) %>%
                    mutate(mths_after_launch = interval(min_month, datemonth) %/% months(1))

ggplotly(
filter(steam_monthly, gamename %in% pull(steam_games_summary %>% head(n = 5), 1)) %>%
  ggplot() +
  geom_line(aes(x = mths_after_launch, y = avg, color = gamename), size = 0.2) +
  geom_point(aes(x = mths_after_launch, y = avg, color = gamename)) +
  coord_cartesian(xlim = c(0,23)) +
  scale_x_continuous(breaks = seq(0, 1000, by = 1)) +
  scale_linetype_manual(values = c("TRUE" = "solid", "FALSE" = "dotted")) +
  labs(title = "Top 5 games monthly average CCU", subtitle = "First 2 years after launch", color = "Descending order", y = "average CCU") +
  guides(col = guide_legend(nrow = 10)) #+
  #scale_color_manual(values = rainbow(20))
)
ggplotly(
filter(steam_monthly, gamename %in% pull(steam_games_summary %>% head(n = 15) %>% tail(n = 10), 1)) %>%
  ggplot() +
  geom_line(aes(x = mths_after_launch, y = avg, color = gamename), size = 0.2) +
  geom_point(aes(x = mths_after_launch, y = avg, color = gamename)) +
  coord_cartesian(xlim = c(0,23)) +
  scale_x_continuous(breaks = seq(0, 1000, by = 1)) +
  scale_linetype_manual(values = c("TRUE" = "solid", "FALSE" = "dotted")) +
  labs(title = "Top 6-15 games monthly average CCU", subtitle = "First 2 years after launch", color = "Descending order", y = "average CCU") +
  guides(col = guide_legend(nrow = 10)) #+
  #scale_color_manual(values = rainbow(20))
)


2. Summary statistics for steam_total

Looking at the summary statistics for total steam data, we can see that the date stretches back further than the previous dataset, covering Jan 2004 to Mar 2021. If we look at the four CCU metrics, peak, the max (peak), min (peak_daily_min) and average (peak_daily_avg) have very similar distributions with high outliers especially at the tail end. For peak_ingame, the data seems to be incomplete as there must be a large amount of zero values for the median to be zero as well.

summary(steam_total)
##    datemonth               peak               gain            gain_perc       
##  Min.   :2004-01-01   Min.   :   84998   Min.   :-1880626   Min.   :-0.34000  
##  1st Qu.:2009-01-16   1st Qu.: 1610970   1st Qu.:  -51752   1st Qu.:-0.01900  
##  Median :2013-02-01   Median : 5667775   Median :   44679   Median : 0.01700  
##  Mean   :2012-12-30   Mean   : 8035487   Mean   :  137261   Mean   : 0.03701  
##  3rd Qu.:2017-02-15   3rd Qu.:13705385   3rd Qu.:  304916   3rd Qu.: 0.05950  
##  Max.   :2021-03-01   Max.   :26850811   Max.   : 4527398   Max.   : 1.57200  
##  peak_daily_min     peak_daily_avg      peak_ingame     
##  Min.   :   84998   Min.   :   84998   Min.   :      0  
##  1st Qu.: 1502789   1st Qu.: 1555473   1st Qu.:      0  
##  Median : 4904558   Median : 5233979   Median :      0  
##  Mean   : 6808661   Mean   : 7323593   Mean   :1262207  
##  3rd Qu.:11647085   3rd Qu.:12480206   3rd Qu.:      0  
##  Max.   :24912223   Max.   :25653203   Max.   :8171592


If we plot out the variables in a line chart, we can see immediately that the outliers are simply a result of a fast growing trend. However for peak_ingame, there is only data starting somewhere in 2017. Even so, there is still a strong (> 0.8) positive correlation between those variables which isn’t suprising as you would expect number of people actually playing games will increase as the number of people online goes up.

chart1a <- ggplot(data = steam_total, aes(x = datemonth)) +
            geom_line(aes(y = peak, color = "Peak CCU (Max)")) + 
            geom_point(aes(y = peak, color = "Peak CCU (Max)")) + 
            geom_line(aes(y = peak_daily_min, color = "Peak CCU (Min)")) + 
            geom_point( aes( y = peak_daily_min, color = "Peak CCU (Min)")) + 
            geom_line(aes(y = peak_daily_avg, color = "Peak CCU (Average)")) +
            geom_point(aes( y = peak_daily_avg, color = "Peak CCU (Average)")) +
            theme(legend.title = element_blank(), axis.title.y = element_blank(), legend.position = "bottom") +
            labs(title = "Min/Avg/Max CCU (Gaming + Idle) by month")

chart1b <- ggplot(steam_total, aes(x = datemonth, y = peak_ingame, color = "Peak In-Game CCU (Average Daily)")) + 
            geom_line() + 
            geom_point() +
            theme(legend.title = element_blank(), axis.title.y = element_blank(), legend.position = "bottom") +
            labs(title = "Avg CCU (Gaming) by month")

grid.arrange(chart1a, chart1b, ncol = 2)

paste("The correlation between peak_daily_avg and peak_ingame is", cor(steam_total$peak_daily_avg, steam_total$peak_ingame))
## [1] "The correlation between peak_daily_avg and peak_ingame is 0.819851024823542"
paste("The correlation between peak_daily_min and peak_ingame is", cor(steam_total$peak_daily_min, steam_total$peak_ingame))
## [1] "The correlation between peak_daily_min and peak_ingame is 0.81818753803157"
paste("The correlation between peak_daily_max and peak_ingame is", cor(steam_total$peak, steam_total$peak_ingame))
## [1] "The correlation between peak_daily_max and peak_ingame is 0.816194954412985"


The avg from steam_monthly data frame is a subset of peak_ingame in this data frame, so it is crucial to understand where the date range overlaps. As seen below, peak_ingame only has data starting from Nov-2017, but there is also a strong (> 0.9) positive correlation between the CCUs in the two data frames. So we could technically look at using linear model to predict the numbers prior to Nov-2017.

steam_monthly_agg <- group_by(steam_monthly, datemonth) %>%
  filter(avg > 0) %>%
  summarize(avg_mean = mean(avg),
            avg_max = max(avg))

  ggplot() + 
  geom_line(data = steam_monthly_agg, aes(x = datemonth, y = avg_max, color = "Max CCU (steam_monthly)")) +
  geom_point(data = steam_monthly_agg, aes(x = datemonth, y = avg_max, color = "Max CCU (steam_monthly)")) +
  geom_line(data = steam_total, aes(x = datemonth, y = peak_ingame, color = "In-game (steam_total)")) +
  geom_point(data = steam_total, aes(x = datemonth, y = peak_ingame, color = "In-game (steam_total)")) +
  geom_text(data = steam_total, aes(x = datemonth, y = peak_ingame, label = ifelse(datemonth == as.Date("2017-11-01"), "Nov-2017", "")), hjust = 1, vjust = -0.5) +
    scale_x_date(date_breaks = "4 months", date_labels = "%b-%y") + 
    coord_cartesian(xlim = as.Date(c("2017-01-01", "2021-03-01"))) +
    theme(legend.title = element_blank(), axis.title.y = element_blank()) +
    labs(title = "Date range for CCU in both data frames")

steam_ccu_corr <- bind_cols(
                            filter(steam_monthly_agg, datemonth > as.Date("2017-10-01"), datemonth < as.Date("2021-03-01")) %>%
                              arrange(datemonth),
                            select(steam_total, c(datemonth, peak_ingame)) %>%
                              filter(datemonth > as.Date("2017-10-01"), datemonth < as.Date("2021-03-01")) %>%
                              arrange(datemonth)
                            )

paste("The correlation between avg (steam_monthly) and peak_ingame (steam_total) for the same period is ", cor(steam_ccu_corr$avg_mean, steam_ccu_corr$peak_ingame))
## [1] "The correlation between avg (steam_monthly) and peak_ingame (steam_total) for the same period is  0.792426681437856"
paste("The correlation between max (steam_monthly) and peak_ingame (steam_total) for the same period is ", cor(steam_ccu_corr$avg_max, steam_ccu_corr$peak_ingame))
## [1] "The correlation between max (steam_monthly) and peak_ingame (steam_total) for the same period is  0.621447730479637"


3. Summary statistics for steam_reviews

There’s a total of 42,836 unique games and it could also mean this data frame could be containing games that are no longer active. There are zero values for all numeric variables, but it is not surprising to see a game with no positive or negative reviews. However, there is 9 games with zero values in both positive and negative reviews, so we should exclude those as they serve no purpose statistically. There isn’t any NA recorded in the dataset, but the distribution seems very right-tailed with huge outliers.

summary(steam_reviews)
##    gamename            positive          negative            rating      
##  Length:42836       Min.   :      0   Min.   :     0.0   Min.   :0.0000  
##  Class :character   1st Qu.:      6   1st Qu.:     2.0   1st Qu.:0.5610  
##  Mode  :character   Median :     22   Median :     7.0   Median :0.6675  
##                     Mean   :   1344   Mean   :   226.9   Mean   :0.6526  
##                     3rd Qu.:    123   3rd Qu.:    37.0   3rd Qu.:0.7582  
##                     Max.   :4830984   Max.   :779666.0   Max.   :0.9762
paste("There are a total of", n_distinct(steam_reviews$gamename), "distinct game titles in the data frame.")
## [1] "There are a total of 42836 distinct game titles in the data frame."
filter(steam_reviews, positive == 0 & negative == 0)


Looking at the charts below, the skewed distribution becomes very apparent. Most of the games have reviews within the IQR range (0 to 1000). It is also worth noticing that the maximum number of positives is about 6 times higher than the maximum number of negatives. There seems to be a strong (> 0.7) positive correlation between positive and negative reviews, but we can’t really conclude anything without some time series review data for each game. Furthermore, if we look at the aggregrated rating variables, you can see the distribution is mostly centered around the 60-70% range and most games have at least 25% rating.

paste("The correlation between positive and negative reviews is ", cor(steam_reviews$positive, steam_reviews$negative))
## [1] "The correlation between positive and negative reviews is  0.774891317716778"
chart2a <- filter(steam_reviews, positive != 0) %>%
              ggplot(aes(positive)) + geom_histogram(fill = "darkgreen") +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with positive reviews") +
              scale_y_continuous(limits = c(0, 50000))

chart2b <- filter(steam_reviews, positive != 0) %>%
              filter(positive <= 500) %>%
              ggplot(aes(positive)) + geom_histogram(fill = "darkgreen", binwidth = 10) +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with < 500 positive reviews") +
              scale_y_continuous(limits = c(0, 15000)) +
              coord_cartesian(xlim = c(0, 500))

chart2c <- filter(steam_reviews, negative != 0) %>%
              ggplot(aes(negative)) + geom_histogram(fill = "darkred") +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with negative reviews") +
              scale_y_continuous(limits = c(0, 50000))

chart2d <- filter(steam_reviews, negative != 0) %>%
              filter(negative <= 500) %>%
              ggplot(aes(negative)) + geom_histogram(fill = "darkred", binwidth = 10) +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with < 500 negative reviews") +
              scale_y_continuous(limits = c(0, 15000)) +
              coord_cartesian(xlim = c(0, 500))

grid.arrange(chart2a, chart2b, chart2c, chart2d, ncol = 2, nrow = 2)

ggplot(steam_reviews, aes(rating)) + 
    geom_histogram(color = "white", fill = "darkblue", binwidth = 0.1) +
    stat_bin(binwidth = 0.1, geom="text", aes(label = ..count..), vjust = -0.5) +
    expand_limits(y = c(0,14000)) +
    scale_x_continuous(breaks = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) +
    theme(axis.title.y = element_blank()) +
    labs(title = "Distribution of games by ratings (%)")


If we filter to the games that are present in the steam_monthly data frame, the distribution doesn’t seem that different. However, if we look at the aggregrated rating, you will see most of these 1218 games have around 80-90% rating and the minimum is sits at 50% or more. This adds another layer of confidence that these games are indeed the cream of the crop. However if we test the correlation of reviews against the CCU data, it seems that negative reviews have a higher correlation then positive review, and the aggregrate ratings have almost zero correlation.

steam_reviews_ccu <- inner_join(steam_reviews, steam_games_summary, by = "gamename")

paste("The total number of games with matching reviews is", n_distinct(steam_reviews_ccu$gamename), "/", n_distinct(steam_monthly$gamename))
## [1] "The total number of games with matching reviews is 1088 / 1118"
summary(steam_reviews_ccu)
##    gamename            positive          negative            rating      
##  Length:1088        Min.   :     84   Min.   :    34.0   Min.   :0.3021  
##  Class :character   1st Qu.:   7090   1st Qu.:   953.8   1st Qu.:0.7825  
##  Mode  :character   Median :  13186   Median :  1986.0   Median :0.8601  
##                     Mean   :  41005   Mean   :  6298.0   Mean   :0.8335  
##                     3rd Qu.:  31531   3rd Qu.:  4295.0   3rd Qu.:0.9079  
##                     Max.   :4830984   Max.   :779666.0   Max.   :0.9762  
##    min_month            max_month                n          no_of_months   
##  Min.   :2012-07-01   Min.   :2012-07-01   Min.   :  1.0   Min.   :  1.00  
##  1st Qu.:2013-12-01   1st Qu.:2019-02-01   1st Qu.:  8.0   1st Qu.: 17.00  
##  Median :2015-12-01   Median :2021-02-01   Median : 29.0   Median : 43.00  
##  Mean   :2016-01-14   Mean   :2019-11-05   Mean   : 37.3   Mean   : 46.69  
##  3rd Qu.:2017-12-01   3rd Qu.:2021-02-01   3rd Qu.: 61.0   3rd Qu.: 72.00  
##  Max.   :2021-01-01   Max.   :2021-02-01   Max.   :104.0   Max.   :104.00  
##     avg_avg            avg_med            avg_min           avg_max         
##  Min.   :   203.3   Min.   :   200.9   Min.   :  200.0   Min.   :    203.3  
##  1st Qu.:   365.2   1st Qu.:   302.6   1st Qu.:  206.3   1st Qu.:    785.3  
##  Median :   601.2   Median :   475.8   Median :  238.6   Median :   1861.2  
##  Mean   :  3693.4   Mean   :  3212.5   Mean   : 1108.6   Mean   :  10676.7  
##  3rd Qu.:  1539.4   3rd Qu.:  1208.6   3rd Qu.:  519.8   3rd Qu.:   5702.3  
##  Max.   :482024.9   Max.   :475924.0   Max.   :64469.8   Max.   :1584886.8  
##     avg_sum        
##  Min.   :     203  
##  1st Qu.:    4518  
##  Median :   21668  
##  Mean   :  207444  
##  3rd Qu.:   68410  
##  Max.   :47188121
chart3a <- filter(steam_reviews_ccu, positive != 0) %>%
              ggplot(aes(positive)) + geom_histogram(fill = "darkgreen") +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with positive reviews") +
              scale_y_continuous(limits = c(0, 50000))

chart3b <- filter(steam_reviews_ccu, positive != 0) %>%
              filter(positive <= 500) %>%
              ggplot(aes(positive)) + geom_histogram(fill = "darkgreen", binwidth = 10) +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with < 500 positive reviews") +
              scale_y_continuous(limits = c(0, 15000)) +
              coord_cartesian(xlim = c(0, 500))

chart3c <- filter(steam_reviews_ccu, negative != 0) %>%
              ggplot(aes(negative)) + geom_histogram(fill = "darkred") +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with negative reviews") +
              scale_y_continuous(limits = c(0, 50000))

chart3d <- filter(steam_reviews_ccu, negative != 0) %>%
              filter(negative <= 500) %>%
              ggplot(aes(negative)) + geom_histogram(fill = "darkred", binwidth = 10) +
              theme(axis.title.y = element_blank()) +
              labs(title = "Distribution of games with < 500 negative reviews") +
              scale_y_continuous(limits = c(0, 15000)) +
              coord_cartesian(xlim = c(0, 500))

grid.arrange(chart2a, chart2b, chart2c, chart2d, ncol = 2, nrow = 2)

ggplot(steam_reviews_ccu, aes(rating)) + 
    geom_histogram(color = "white", fill = "darkblue", binwidth = 0.1) +
    stat_bin(binwidth = 0.1, geom="text", aes(label = ..count..), vjust = -0.5) +
    expand_limits(y = c(0,800)) +
    scale_x_continuous(breaks = c(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0)) +
    theme(axis.title.y = element_blank()) +
    labs(title = "Distribution of games by ratings (%)")

paste("The correlation between avg CCU and positive reviews is", cor(steam_reviews_ccu$avg_avg, steam_reviews_ccu$positive))
## [1] "The correlation between avg CCU and positive reviews is 0.660403967199178"
paste("The correlation between avg CCU and negative reviews is", cor(steam_reviews_ccu$avg_avg, steam_reviews_ccu$negative))
## [1] "The correlation between avg CCU and negative reviews is 0.869849107940706"
paste("The correlation between avg CCU and ratings is", cor(steam_reviews_ccu$avg_avg, steam_reviews_ccu$rating))
## [1] "The correlation between avg CCU and ratings is -0.0374698310814259"


But if we plot positive and negative as a scatter plot, you can see that the correlation for negative is heavily skewed due to PUBG having low positive and high negative, and yet having high CCU too.

ggplotly(
ggplot(steam_reviews_ccu) +
  geom_point(aes(x = negative, y = positive, color = gamename, size = avg_avg))
)


So does positive, negative and ratings impact average CCU?

If we simply ran a linear model for avg CCU on positive, negative and ratings; we have a high 0.7605 r-squared explaining the dependant variable, and both negative and rating are significant in their p-values. But we know that this is due to PUBG skewing the data.

lm(avg_avg ~ positive + negative + rating, steam_reviews_ccu, levels = 0.95) %>% summary()
## 
## Call:
## lm(formula = avg_avg ~ positive + negative + rating, data = steam_reviews_ccu, 
##     levels = 0.95)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78541  -1426   -678    423 311624 
## 
## Coefficients:
##                  Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept) -12561.569398   2951.891627  -4.255            0.0000227 ***
## positive        -0.006123      0.003307  -1.852               0.0643 .  
## negative         0.639504      0.016896  37.850 < 0.0000000000000002 ***
## rating       14971.164606   3523.510526   4.249            0.0000233 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11440 on 1084 degrees of freedom
## Multiple R-squared:  0.7608, Adjusted R-squared:  0.7601 
## F-statistic:  1149 on 3 and 1084 DF,  p-value: < 0.00000000000000022


So if we exclude the top 3 games - PUBG, Counterstrike and Dota, and run the linear model again; immediately we see a lower residual estimates and our standard error drops by half. And although positive and negative are both significant now, the R-squared drops to about 0.5; which is about as good as a coin toss. Based on that, we can only conclude that this reviews data frame have no significant impact on our CCU data.

steam_reviews_ccu_no_top_3 <- filter(steam_reviews_ccu, !(gamename %in% pull(steam_games_summary %>% head(n=3), 1)))

lm(avg_avg ~ positive + negative + rating, steam_reviews_ccu_no_top_3, levels = 0.95) %>% summary()
## 
## Call:
## lm(formula = avg_avg ~ positive + negative + rating, data = steam_reviews_ccu_no_top_3, 
##     levels = 0.95)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -26024   -986   -336    199 105498 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) -3417.167051  1578.287262  -2.165              0.0306 *  
## positive        0.031376     0.003698   8.485 <0.0000000000000002 ***
## negative        0.325766     0.024540  13.275 <0.0000000000000002 ***
## rating       3992.590871  1886.739241   2.116              0.0346 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5562 on 1081 degrees of freedom
## Multiple R-squared:  0.5027, Adjusted R-squared:  0.5013 
## F-statistic: 364.2 on 3 and 1081 DF,  p-value: < 0.00000000000000022



So… what’s next?

Based on the EDA, we can already draw some conclusions about the trends.

The next steps would be to search for other possible predictive datasets like genres, gametime, player demographics, etc.